Using machine learning to detect coronaviruses potentially infectious to humans

Establishing the host range for novel viruses remains a challenge. Here, we address the challenge of identifying non-human animal coronaviruses that may infect humans by creating an artificial neural network model that learns from spike protein sequences of alpha and beta coronaviruses and their binding annotation to their host receptor. The proposed method produces a human-Binding Potential (h-BiP) score that distinguishes, with high accuracy, the binding potential among coronaviruses. Three viruses, previously unknown to bind human receptors, were identified: Bat coronavirus BtCoV/133/2005 and Pipistrellus abramus bat coronavirus HKU5-related (both MERS related viruses), and Rhinolophus affinis coronavirus isolate LYRa3 (a SARS related virus). We further analyze the binding properties of BtCoV/133/2005 and LYRa3 using molecular dynamics. To test whether this model can be used for surveillance of novel coronaviruses, we re-trained the model on a set that excludes SARS-CoV-2 and all viral sequences released after the SARS-CoV-2 was published. The results predict the binding of SARS-CoV-2 with a human receptor, indicating that machine learning methods are an excellent tool for the prediction of host expansion events.

www.nature.com/scientificreports/ of representatives of each taxa across hosts and disregard the specific properties of the virus, preventing further mechanistic analysis of host expansion pathways.
In this work, we study the potential of alpha and beta coronaviruses to cause human infection. In particular, we aim at predicting whether the spike (S) protein of a coronavirus binds a human receptor. The S protein decorates the exterior of the viral envelope and is key in host expansion since its binding to the host receptor protein triggers the infection process [22][23][24] . Starting with a collection of amino acid sequences from the S protein, we build a machine learning model that predicts binding to a human host receptor. We propose a skip-gram model which uses a neural network to transform the sequences into vectors. These vectors encode the relationship between neighboring protein sequences of length k (i.e. k-mers). A classifier uses these vectors to score each sequence according to its binding potential to a human receptor. We call this score the human-binding potential (h-BiP). We use a dataset consisting of 2,534 unique spike sequences from alpha and beta coronaviruses spanning all clades and variants (see "Methods"). The classifier is highly accurate, and its h-BiP score is highly correlated with sequence identity against human viruses. Moreover, the proposed h-BiP score also discriminates the binding potential in cases with similar sequence identity and detects binding in cases of low sequence identity. We identify three viruses, Bt133 25 , Pipistrellus abramus bat coronavirus HKU5-related (HKU5r) 26 and LYRa3 27 , with high h-BiP values and yet unknown human binding properties. Consistent with this finding, a phylogenetic analysis shows that Bt133, HKU5r and LyRa3 are related to non-human viruses known to bind human receptors. Furthermore, a multiple sequence alignment of the receptor binding motifs (RBM) of Bt133 and of LYRa3 with their related viruses revealed that they conserve the contact residues with the human receptor. Molecular dynamics (MD) of the receptor binding domain (RBD) validates binding and identifies contact residues with human receptors. Finally, we test whether this model can be used for the surveillance of host expansion events. We emulate the conditions prior to SARS-CoV-2 emergence by excluding from the training set all coronavirus sequences published after December 31 st , 2019 and find that the re-trained model predicts binding of the wild type of SARS-CoV-2 to a human receptor.

Results
h-BiP: a machine learning approach for scoring human-binding potential of coronavirus sequences. We propose a human-Binding Potential (h-BiP) score that assigns a value between 0 and 1 to spike proteins of alpha and beta coronaviruses. The outline of the method to obtain h-BiP is shown in Fig. 1, and a full description is available in "Methods" section. First, we partitioned each protein sequence as a sequence of trimers, which we used to produce trimer embeddings in a high-dimensional Euclidian space. We selected a skip-gram model 28 to ensure that each trimer embedding was informed by neighboring trimers within a context window. Next, an embedding for the entire protein sequence was generated by adding all of its trimer embeddings. Each coronavirus, together with all its available variants in the dataset, was labeled as positive or negative according to its published binding annotation to any, known or unknown, human receptor (Table 1). To produce the h-BiP score, we built a logistic regression classifier on the sequence embeddings to predict binding. Viruses with h-BiP score of 0.5 or higher were classified as likely to bind a human receptor.  Methodological workflow of the human Binding Potential (h-BiP) score. Left: preprocessing sequences from alpha and beta coronaviruses. Top: whether the S protein was available from annotation or by extraction from whole-genome, the dataset consists of 2534 unique S protein sequences. Each protein sequence is transformed into a trimer (3 amino acid) representation by sliding a window one amino acid at a time. Bottom: we curated the host field and annotated the sequences according to their binding status to human receptors. Regardless of the host, a virus is considered positive for binding if there is experimental evidence of binding to a human receptor. Right: a skip-gram model uses a neural network to generate trimer embeddings of a fixed dimension (d = 100). These trimer embeddings are numerical vectors that encode information from all neighboring trimers within a context window in the protein sequence. Next, we compute the final sequence embedding (d = 100) by adding up all of its trimer embeddings. The scatterplot shows a visualization for the embeddings from all viruses after using t-distributed stochastic neighbor embedding (tsne) to reduce dimensionality. Finally, all sequence embeddings feed a classifier (logistic regression) to produce the h-BiP score that learns from the binding information of alpha and beta coronaviruses. An h-BiP score greater than or equal to 0.5 flags the virus as likely for human binding.  Table 2 shows the prediction results for the full data set. The method achieved 99% accuracy, 99% sensitivity and 99% specificity in the test data set (see Supplementary Fig. S1). A total of 805 sequences had an h-BiP score less than 0.5. Of these, 796 were true negatives. While binding status was unknown for most of these sequences, viruses such as BM48-31 and Rf1 for which experimental studies found no evidence of binding to a human receptor 38 were confirmed as non-binding by h-BiP (i.e. h-BiP score < 0.5). Three viruses with unknown binding status had an h-BiP score ≥ 0.5, suggesting they may potentially bind human receptors. Bat coronavirus BtCoV/133/2005 (Bt133), Pipistrellus abramus bat coronavirus HKU5-related (HKU5r) and Rhinolophus affinis coronavirus isolate LYRa3. Nine viruses were classified as false negatives (see Table 3). The h-BiP scores of these viruses ranged from 0.13 to 0.44.

Binding annotation to human receptor
The h-BiP score is consistent with sequence identity and expands viral classification. We verified that the proposed model, while consistent with sequence alignment results, provides additional information. Percent sequence identity (% identity) of a newly detected virus with known human viruses 10, 18 is often used to assess human infectivity. We computed the pairwise % identity between each of the 7 human coronaviruses and the S protein sequences in our dataset and selected the maximum for each sequence. All cases with 93 or higher protein % identity with known human coronaviruses had a h-BiP score greater than 0.5. Furthermore, the Pear-  www.nature.com/scientificreports/ son correlation between these maximum % identities and our proposed h-BiP score was 0.96 ( Fig. 2). Hence, we conclude that our method is consistent with standard sequence identity approaches. Our study identified 26 bat viruses with maximum % identity < 83% and an h-BiP ≥ 0.5. There is experimental evidence that these viruses bind to a human receptor, except for Bt133 and HKU5r (Table 1). A protein BLAST 43 against all coronaviruses confirmed that Bt133 and HKU5r have a low % identity with any human coronavirus. The h-BiP score also discriminates between viruses that have similar % identity. For instance, the spike proteins from bovine and pangolin coronaviruses have % identities that range from 90 to 92% (see Fig. 2). The h-BiP scores for pangolin coronaviruses and bovine coronaviruses are greater than 0.97 and less than 0.4, respectively. This finding agrees with current experimental results that show binding of pangolin coronaviruses to human receptors 30 . To our knowledge, only the alpaca isolate from Bovine coronaviruses 44 has been reported for human infections.
The h-BiP score predicts that the S protein of viruses Bt133, HKU5r and LYRa3 bind to human receptors. Our method assigned a h-BiP score > 0.5 to Bt133, HKU5r and LYRa3, suggesting that these three viruses with unknown human binding status may bind to a human receptor. Bt133 is a beta coronavirus from the Merbecovirus subgenus, and it is phylogenetically related to Ty-HKU4 (see Fig. 3a), a bat coronavirus for which there is experimental evidence of binding to human receptor dipeptidyl peptidase 4 (hDPP4) 24,36 . HKU5r also belongs to the Merbecovirus subgenus and it is phylogenetically related to HKU5 (Fig. 3a). While experimental evidence shows that HKU5 binds human cells, it is known that it does not bind hDPP4 24 . In fact, the specific human receptor of HKU5 still remains unknown 40 . LYRa3 is a beta coronavirus from the Sarbecovirus subgenus, and it is phylogenetically related to Rhinolophus affinis coronavirus isolate LYRa11 27 (see Fig. 3b), which binds human receptor angiotensin-converting enzyme 2 (hACE2) 29,31 . Host recognition and cell entry is mediated by the S protein [22][23][24] . The high sequence identity (97%) that the S protein of Bt133 shares with that of Ty-HKU4 suggests that Bt133 binds to hDPP4. Similarly, a 99% S protein sequence identity between LYRa3 and LYRa11 suggests that LYRa3 binds to hACE2. The S protein of HKU5r has a 93% sequence identity with that of HKU5. This high % sequence identity between HKU5 and HKU5r together with the high h-BiP value, suggests that HKU5r binds a human receptor different from hDDP4.
The S protein binds to the human receptor through the receptor binding domain (RBD). The RBD is composed of a core domain and the receptor binding motif (RBM) that comes in direct contact with the host receptor 24 . A multiple sequence alignment of Bt133 with typical members of the Merbecovirus subgenus at the RBM (see Fig. 4a) revealed that Bt133 conserves all 8 contact residues used by Ty-HKU4 to bind hDPP4 24,36 . Thus, suggesting that Bt133 binds hDPP4. Figure 4a also shows a region of the HKU5 genome that aligns with the RBM of Ty-HKU4. When compared to Ty-HKU4, this region of HKU5 shows two deletions and different amino acids at the contact residues. These observations are consistent with the lack of binding of HKU5 to hDPP4.
LYRa3 and LYRa11 are phylogenetically related to SARS-CoV Tor2, and both share 89.7% sequence identity at the S gene. A multiple sequence alignment for related viruses within the Sarbecovirus subgenus shows that LYRa11 and LYRa3 are identical at the RBM except at residue H441. Twelve of the 17 contact residues that SARS-CoV Tor2 uses to bind to hACE2 are conserved (see Fig. 4b). In contrast, experimental studies from bat Comparison of sequence % identity and h-BiP score for alpha and beta coronaviruses. The x-axis represents the maximum % identity computed from a particular virus against the seven known human coronaviruses. The y-axis shows the h-BiP score. Each point in the graph represents a sequence in the dataset. Regardless of their host, red crosses depict sequences of viruses known to bind a human receptor, and grey points represent those viruses not known to do so. Points above the blue dashed horizontal line have a h-BiP score greater or equal than 0.5 (i.e. positive for binding). The blue dashed vertical line is the 97% identity reference line. The spike protein of bat coronavirus RaTG13 (depicted with a red star) is known to bind to human receptor hACE2 and it has a 97.46% amino acid identity against SARS-CoV-2 and a 0.999 h-BiP score. Three viruses with h-BiP ≥ 0.5 and yet unknown binding, Bt133, HKU5r and LYRa3 are highlighted. Bat|Rc-o319|0 Bat|BM48-31/|0 Human, civet, mink|SARS-CoV|1 Human, feline, minks|SARS-CoV-2|1 Pangolin|PCoV-GX|1    www.nature.com/scientificreports/ coronavirus ZC45 45 did not find evidence of binding to hACE2 29,31 . When compared to SAR-CoV Tor2, ZC45 shows three deletions at the RBM and different amino acids at all locations of the contact residues of SAR-CoV Tor2, except for one at Y449. These results suggest that LYRa3 binds to hACE2.

Molecular dynamics confirm binding of S to human receptors for coronaviruses Bt133 and
LYRa3. Phylogenetic analysis, alignment results and the h-BiP score suggest that bat coronaviruses Bt133 and LYRa3 potentially bind to human receptors and are, therefore, candidates for host expansion to humans. We then used molecular dynamics simulations (MD) to validate binding in silico and to determine their contact residues. Three-dimensional structures of the receptor binding domain (RBD) bound to their corresponding human receptor for Lyra3 and Bt133 were obtained from crystal structure data and molecular modeling (see "Methods"). The S protein of LYRa3 shares 99% identity with LYRa11 with one single point mutation at the RBD. Since LYRa11 is known to bind hACE2 29, 31 , we expect LYRa3 to have similar binding properties. Results from three independent simulations showed that the average H-bond count between the RBD of LYRa3 and hACE2 was 5.1 (see Table 4). Similarly, the average number of H-bonds between the RBD of LYRa11 and hACE2 was 5.8 H-bonds. The difference in the number of H-bonds between LYRa3 and LYRa11 was not statistically significant (p-value: 0.2947), suggesting that they have comparable binding energies. Two different estimations for the binding free energy (see "Methods") of both complexes confirmed that the difference was not statistically significant (HawkDock p-value: 0.7599). A breakdown of the different types of interactions is available at Supplementary Table S1 (PRODIGY 46 ) and Supplementary Table S2 (HawkDock 47 ). The contact residues for LYRa3 were unknown. Therefore, we identified all of its contact residues during the course of the simulation (Supplementary  Table S3). Our simulations revealed that contact residues G492, N477, T490, G486 and Y485 were present in at least 45% of the sampled conformations. A comprehensive list of all the bonds and their respective frequencies are available in (Supplementary Table S3).
Next, we analyzed the interaction between the S protein of Bt133 and hDPP4. Despite containing 13 mutations in its RBD, alignment at the RBM showed that the contact residues of Ty-HKU4 are also present in Bt133. Results from three independent simulations showed that the average count of H-bonds was 5.4 for Bt133 and 6.0 for Ty-HKU4 (Table 4). The difference in average values was not statistically significant (p-value: 0.3780), suggesting that Bt133 and Ty-HKU4 have comparable binding energies when bound to hDPP4. The binding free energy from the two estimations of both complexes (Supplementary Tables S1 and S2) confirmed that the difference was not statistically significant (HawkDock p-value: 0.6009).
Our simulations confirmed all reported contact residues between Ty-HKU4 and hDPP4 24 except for Q544. Our simulations also revealed contact residues N468, S465 and Y460, which have not been previously reported (Supplementary Table S4). Figure 5 shows contact residues for both Ty-HKU4 (a) and Bt133 (b). Only two contact residues were frequently observed in both viruses. A bond between E518 (magenta) in the RBD and Q344 Table 4. Average H-bonds between RBD and human receptor. Average number of H-bonds between the RBD and the human receptor by MD simulation (Sim1, Sim2, Sim3). The average was computed from all available sampled conformations after reaching equilibrium (4 ns). The average number of H-bonds of the three simulations and corresponding standard error is also shown. A t-test was performed to compare the means between LYRa11 and LYRa3 and between Ty-HKU4 and Bt133. The p-value corresponds to the two tailed test. www.nature.com/scientificreports/ in hDPP4, and between N514 (red) in the RBD and R317 in hDPP4, were present in at least 94% of the sampled conformations. Two additional contact residues were present in at least 50% of the sampled conformations from Ty-HKU4: K506 (purple) and K547 (orange). However, these two contact residues were present in less than 39% sampled conformations of Bt133 (Supplementary Table S5). Instead, only one additional contact residue was present for Bt133 in more than 70% of the sampled conformations: Q515 (yellow).

The h-BiP score predicts binding of viruses that do not use the canonical receptor. Some coro-
naviruses are known not to bind canonical receptors 24,39 . For instance, two recently reported MERS-related coronaviruses PDF2180 and NeoCoV have been shown to bind human ACE2 instead of hDPP4 39 . The h-BiP scores for these two betacoronaviruses were 0.95 and 0.88 respectively. These results combined with an h-BiP score of 0.98 for HKU5 (positive for binding) indicates that h-BiP can properly classify binding of coronaviruses to human receptor independently of the receptor preference of the virus.
The h-BiP score predicts binding of SARS-CoV2 to hACE2. We explored whether the proposed method can be used to detect potential human-infection of novel coronaviruses. To test this hypothesis, we repeated the study and computed a h-BiP score for SARS-CoV-2 after excluding all SARS-CoV-2 viruses and all viral sequences uploaded to the database after Dec. 31, 2019 from the training set. This new data set consisted of 1272 viruses with 540 labeled as positive for binding to a human receptor. In this case the h-BiP value of SARS-CoV2 was 0.62. The model had sensitivity and specificity values of 99%. We conclude the proposed method would have predicted binding of SARS-CoV-2 to a human receptor in the early stages of the pandemic.

Discussion
The COVID-19 pandemic has demonstrated the need to develop tools to predict spillover events. At the cellular and molecular level, three key steps, which are yet to be fully understood, are required for a spillover event: the evasion of the host's immune system by the virus 4 , the infection of the cell by the virus 23,24 and the replication of the virus in the new host cell 4,23 . In coronaviruses, the infection event is mediated by the binding of the viral spike (S) protein to the host receptor 23,24 . The vast abundance of recently collected S protein data opens the way for machine learning (ML) methods that will help accelerate the pace of discovery of virus candidates for spillover events.
Here, we propose a new machine learning approach to predict the binding of alpha and beta coronaviruses to a human receptor. We call this model the human-Binding Potential (h-BiP). As an alignment-free approach, h-BiP overcomes limitations associated with multiple sequence alignment (msa) methods, such as the lack of robustness against genomic rearrangements or low computational efficiency (O(nm) for msa) 19 . While most alignment-free approaches rely on k-mer counts, which disregards their relative position in the genome 14 , h-BiP uses a neural network to create k-mer embeddings through a context window. These embeddings encode information about their neighboring k-mers in the sequence, and the resultant classifier is highly accurate.
Viruses with known human receptor are overrepresented in the group of viruses labeled as having evidence of binding and consequently the method has an inherent bias to correctly classify those viruses with known human receptor. To account for this bias, we included viruses with unknown human receptor. Our method benefits from this information which is in contrast with msa and biophysical methods that require knowledge on the position of the RBM or binding residues.
The h-BiP score is highly correlated with % identity against human coronaviruses. Yet, the classifier discriminates among viruses with similar % identity and identifies viruses with low % identity that may bind to human receptors. Such is the case of bat coronavirus LYRa3, which has a similar % identity value to other viruses with unknown binding status to human receptor such as HKU14; and the cases of Bt133 and Pipistrellus abramus HKU5-related virus (HKU5r), with low % identity (67.2% and 64.1% resp.). However, the three of them have an h-BiP score greater than 0.5. The method suggests that these viruses with previously unknown binding status may bind to a human receptor. Phylogenetic analysis revealed that they are closely related to viruses known to bind human receptors. A pairwise alignment between the RBDs of Ty-HKU4 and Bt133 revealed that Bt133 conserves all 8 contact residues used by Ty-HKU4 to bind hDPP4 despite having 13 mutations in the RBD. On the other hand, with the exception of residue 441, the protein sequences of LYRa3 and LYRa11 are identical at the RBM. These findings suggest that Bt133 and LYRa11 bind to hDPP4 and hACE2, respectively.
We detected HKU5r, with a h-BiP score of 0.99. HKU5r has a 93% identity in the S protein with HKU5. Interestingly, HKU5 is a Merbecovirus for which the human receptor is unknown and it has been shown not to bind to hDPP4 24 . This is one example on how the prediction of our method improves when we add viruses for which the only information known is whether they bind or not to any human receptor. The absence of a known human receptor for HKU5 limits the possibility of further biophysical studies of binding hence we did not pursue them here.
We confirmed binding of Bt133 and LYRa3 in silico through a combination of structural modeling and molecular dynamics simulations. The average number of contact residues from the simulations of both viruses was not statistically different from that of Ty-HKU4 and LYRa11, respectively. We used HawkDock and PROD-IGY to estimate binding energies. We selected HawkDock's molecular mechanics/generalized Born surface area (MM/GBSA) for the calculations. MM/GBSA overestimated the magnitude of binding free energy but the obtained values correlate well with H-bond counts 48 . While PRODIGY results provided closer estimates for binding energies measured in in-vitro studies 48 , it did not discriminate well among variants. The binding energy estimation from Bt133 and LYRa3 was not statistically different from that of Ty-HKU4 and LYRa11, respectively. These results suggest they have comparable binding energies. We identified all contact residues of Bt133 and Ty-HKU4 (Supplementary Tables S4 and S5). Only contact residues E518 and N514 were frequently used (> 94% www.nature.com/scientificreports/ of sampled conformations) by both viruses to bind to the human receptor. Contact residue E518 is also known to be relevant for binding of MERS 24, 36 to hDPP4. Our simulations also revealed three previously unreported contact residues from Ty-HKU4 (N468, S465 and Y460). Our study also identified the contact residues of LYRa3 (Supplementary Table S3). As expected with any classifier, h-BiP produced a few false negatives in our dataset. Such is the case of HKU3-13 (h-BiP = 0.2), which is the only known member from the HKU3 clade experimentally confirmed for human binding in our dataset. Interestingly, HKU3-8 has been reported not to bind human receptors 38 .
We also tested the proposed method under the scenario for the emergence of a novel virus. In particular, we asked whether h-BiP would predict the binding of a novel coronavirus such as SARS-CoV-2. In the absence of all SARS-CoV-2 viruses and any viral sequence uploaded after its publication, the h-BiP score for the wild type of SARS-CoV-2 was 0.62, demonstrating that h-BiP may be a valuable tool to detect the potential of a virus to cross species and originate an epidemic.

Methods
Datasets. On 11/05/2020, we downloaded 28,368 RNA spike protein sequences of all alpha and beta coronaviruses from the NCBI Virus 49 database. Compared to the number of full nucleotide sequences, the number of annotated sequences for the Sarbecovirus genus (excluding SARS-CoV-2) was limited. On 07/26/2021, we downloaded all available nucleotide sequences and extracted the S protein (see section "Extracting the S protein from full sequences" for details), expanding the data from 78 to 194 unique sequences from the Sarbecovirus genus (excluding SARS-CoV-2). To reduce the impact of an unbalanced dataset, we randomly removed 50% of SARS-CoV-2 (under-sampling), preserving reference sequences. The final alpha and beta coronavirus dataset consisted of 2,534 amino acid sequences. We curated the host field by combining information from isolation source, submission notes and the related publication. We also removed 7 viruses that were genetically modified (Accessions: FJ882951, FJ882957, HQ890538, FJ882942, HQ890534, MT782114, MT782115) as well as a draft virus from a pangolin (MT084071). Sequences in the data set were annotated as positive or negative for human binding. Human coronaviruses, and viruses from non-human animal hosts with published binding evidence to a human receptor, were labeled as positive for binding (n = 1735). Viruses from non-human hosts reported as unable to bind human receptors, and those for which the binding condition was unknown, were labeled as negative for binding (n = 799). A comprehensive list of viruses with confirmed binding status is available in Table 1. The final dataset, according to binding status, is available in Table 5.
Extracting the S protein from full sequences. On 07/26/2021, we downloaded all available nucleotide (nt) sequences from the Sarbecovirus genus (excluding SARS-CoV-2) for a total of 643 sequences with at least 1603 nt of length. The S protein contains two subunits: the S1 subunit containing the receptor-binding-domain, and the S2 subunit, which mediates membrane fusion 50,51 . Prior to SARS-CoV-2 emergence, a highly conserved domain across all coronaviruses (SFIEDLLFNKVTLADAGF, NCBI accession cl40439) was found in the S2 subunit 51 . In order to extract the S protein from the full RNA sequence, we first translated it into the three possible reading frames. Next, we located the conserved domain by searching for the "SFIEDLLFN" motif (allowing for at most one substitution). The final putative S protein was the sequence enclosed by the start and stop codons, which contained the "SFIEDLLFN" motif. A total of 554 S proteins (194 unique) were found and included in the final dataset. We tested different motif lengths and number of substitutions to locate the conserved domain, but 9 amino acids, and 1 substitution, were found to be optimal.
Sequence embedding. In order to generate n-dimensional vectors from the protein sequences, we first built "words" of 3 letters (trimers) with 3 contiguous amino acids. We considered all possible trimers by sliding a window one amino acid at a time as shown in Fig. 1. To produce trimer embeddings, we used a skip-gram with negative sampling (vector-size = 100, context-size = 25, negative samples = 1) as in 52. Finally, each sequence was represented as the sum of vectors from all of its trimers. That is, if a particular sequence (seq k ) consisted of n trimers {w 1 , w 2 , …, w n }, and each trimer was represented by a 100-dimensional vector w i = [x i_1 , x i_2 , …, x i_100 ], then the final sequence embedding was calculated from Eq. (1).
(1) seq k = n i=1 w i www.nature.com/scientificreports/ Classification. In order to classify the sequences according to their potential to bind a human receptor, we labeled them as positive or negative for human binding as described in "Datasets" section. Normalized sequence embeddings of the training set were used to create a logistic regression model. The model used a ridge classifier with C = 1.5 inverse regularization strength and the limited memory BFGS (L-BFGS) optimization algorithm. A virus was classified as positive for binding if the score was 0.5 or higher. The final classifier was invariant to the threshold value (see Supplementary Fig. S1).
Phylogenetic tree. A phylogenetic tree for the S protein of alpha and beta coronaviruses was generated from a subset of 424 sequences (out of 2534) from the original dataset. This subset includes all reference viruses from both genera, several members from each of the 7 human coronaviruses and all viruses from non-human hosts with positive or negative published human binding annotation (see Table 1). A comprehensive list is available in Supplementary Table S4. A multiple sequence alignment (msa) of all sequences in the subset was generated with BBMap 53 . We used BEAST 54 on the msa to reconstruct multiple phylogenetic trees with a log-normal distributed relaxed molecular clock and a non-parametric coalescent prior. The final tree shown in Supplementary Fig. S2 corresponds to the maximum-clade-credibility tree. We used iTOL 55 for tree visualization and annotation.
Candidate structures for molecular dynamics. The starting RBD structures for molecular dynamics (MD) simulations were obtained using PDB file 4QZV 24 of Ty-HKU4-hDPP4 complex. At the RBD, Bt133 differs from Ty-HKU4 in 13 residues. These mutations were added to PDB 4QZV 24 using YASARA 56 . The starting RBD structures for the RBD of LYRa3 and LYRa11 were generated using AlphaFold 57 (implemented within the ColabFold suite 58 ). Due to the absence of experimentally determined LYRa3-hACE2 and LYRa11-hACE2 complexes, we used the SARS-CoV-hACE2 complex (PDB 2AJF 59 ) as a template in AlphaFold. The resulting structures were aligned to SARS-CoV using the MUSTANG 60 method in YASARA 56 . These alignments were in strong agreement (< 2.0A) and used as starting structures in our simulations.

Molecular dynamics.
To simulate protein-protein interactions, we used the molecular-modelling package YASARA 56 to substitute individual residues and to search for minimum-energy conformations on the resulting modified candidate structures. For all structures, we carried out an energy minimization (EM) routine, which included steepest descent and simulated annealing minimization (until free energy stabilizes to within 50 J/ mol) to remove clashes. All MD simulations were run using the AMBER14 force field 61 for solute, GAFF2 62 and AM1BCC 63 for ligands and TIP3P 63 for water. The cutoff was 8 Å for Van der Waals forces (AMBER's default value 64 ), and no cutoff was applied for electrostatic forces (using the Particle Mesh Ewald algorithm 65 ). The equations of motion were integrated with a multiple timestep of 1.25 fs for bonded interactions and 2.5 fs for non-bonded interactions at T = 298 K and P = 1 atm (NPT ensemble) via algorithms described in 67. Prior to counting the RBD's hydrogen bonds and calculating the free energy, we carried out several pre-processing steps on the structure, including an optimization of the hydrogen-bonding network 66 to increase the solute stability and a pKa prediction to fine-tune the protonation states of protein residues at the chosen pH of 7.4 67 . Insertions and mutations were carried out using YASARA's BuildLoop and SwapRes commands 67 , respectively. Structure conformations from the simulations were collected every 100 ps after 4 ns of equilibration time as determined by the solute root mean square deviations (RMSDs) from the starting structure. For all bound structures, we ran the simulations for at least 10 ns post equilibrium and verified stability of time series for RBD-receptor hydrogen bond counts and root mean square deviation (RMSD) from these starting structures. Hydrogen bonds (H-bonds) were counted and tabulated using a distance and an angle approximation between donor and acceptor atoms as described in 66. It is important to note in this approach, salt bridges of proximate residues are effectively counted as H-bonds between basic side chain amide groups and acidic side chain carboxyl groups. Therefore, ionic interactions are also included in the H-bond count. In a previous publication, we show that the H-bond count, as defined here, correlates well with binding free energy estimates that were obtained using the molecular mechanics/generalized Born surface area method.

Average number of H-bonds.
In previous studies, we have shown that the average number of H-Bonds correlates well with binding energy 48 . Therefore, for each MD simulation, we recorded the number of H-bonds formed between protein-protein interactions (including ionic bonds) at each sampled conformation (snapshots). The average number of H-bonds was computed from all available sampled conformations after reaching equilibrium (4 ns). We performed three independent MD simulations from each complex and determined the grand average and standard error. Results are available in Table 4.

H-bond frequencies (%).
At each sampled conformation (snapshot) from a MD simulation, we tracked every H-bond (including ionic bonds) and recorded the participant amino acids from the ligand and receptor. We computed the frequency for each pair dividing the number of sampled conformations where the pair was present by the total number of sampled conformations (double bonds are not considered in this count) and multiplied by 100. H-bond frequencies (%) from every simulation are available in the Supplementary Information S1.
Prediction of binding affinity. To estimate the binding free energy for each of the RBD-receptor complexes we used the energy-minimized structures on HawkDock 47 and PRODIGY 46 servers. We selected the